(Image credit: Hadley Wickham)
R!R and the TidyverseR without any packages.R is…not a good programming language.If you have prior experience in R and did not begin all your scripts with library(dplyr)….
R and the TidyverseFilter(), sort(), and select()x %>% f() \(\Longleftrightarrow\) f(x)x, and then do f to it”x %>% f(y) \(\Longleftrightarrow\) f(x,y)x %>% f(y) %>% g(z) \(\Longleftrightarrow\) g(f(x,y),z)x, then do f with option y, then do g with option z…”# familiar
listings %>% glimpse() # = glimpse(listings)
listings %>% head() # = head(listings)
listings %>% colnames() # = colnames(listings)
# get all columns with "review_scores" in the column name
listings %>% select(contains('review_scores'))
# what should this return?
listings %>% select(contains('review_scores')) %>% colnames()
# compare: colnames(select(listings, contains('review_scores')))Let’s try this out – back to the case study!
Go from this:
| id | neighbourhood_cleansed | review_scores_rating |
|---|---|---|
| 12147973 | Roslindale | NA |
| 3075044 | Roslindale | 94 |
| 6976 | Roslindale | 98 |
| 1436513 | Roslindale | 100 |
| 7651065 | Roslindale | 99 |
| 12386020 | Roslindale | 100 |
…to this:
| neighbourhood_cleansed | n | mean_rating |
|---|---|---|
| Leather District | 5 | 98.33333 |
| Roslindale | 56 | 95.38000 |
| West Roxbury | 46 | 95.21212 |
| South Boston Waterfront | 83 | 94.43103 |
| Jamaica Plain | 343 | 94.15932 |
| Longwood Medical Area | 9 | 94.00000 |
# Compute a summary statistic
data %>% summarise(measure = formula(col1, col2))
# Make a new column
data %>% mutate(new_col = formula(old_col1, old_col2))
# Compute a summary statistic for each group
data %>%
mutate(group_col = formula(old_col1, old_col2))
group_by(group_col) %>%
summarise(measure = formula(col1, col2))filter and summarisejoining data## # A tibble: 1 x 2
## earliest latest
## <date> <date>
## 1 2016-09-06 2017-09-05
But some of these listings may be “zombies” without recent availability. How can we include only listings with availability from a certain time period?
calendar table (exercise)join that information to the listings table (together)The information we need is distributed between two tables – how can we get there?
We need a key column that tells us which calendar rows correspond to which listings.
listings$idcorresponds tocalendar_listing$id
joinThe join family of functions lets us add columns from one table to another using a key.
x %>% left_join(y) : most common, keeps all rows of x but not necessarily y.x %>% right_join(y) : keeps all rows of y but not necessarily x.x %>% outer_join(y) : keeps all rows of both x and yx %>% full_join(y) : keeps only rows of x that match in y and vice versa.We’ll use left_join for this case – let’s try it in the case study.
ggplot2Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.
– Edward Tufte
A grammar is a set of components (ingredients) that you can combine to create complex structures (sentences, recipes, data visualizations). In baking….
gg in ggplot2.tidyverseData: almost always a data_frameAesthetic mapping: relation of data to chart components.Geometry: specific visualization type? E.g. line, bar, heatmap?Statistical transformation: how should the data be transformed or aggregated before visualizing?Theme: how should the non-data parts of the plot look?+ plays the same role in ggplot2 that %>% does in data manipulation.)Does getting lots of reviews usually mean you get good reviews?
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw()listings %>%
filter(number_of_reviews < 100) %>% ##
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw() listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw() +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2, color = 'firebrick') + ##
theme_bw() +
labs(x='Number of Reviews', y='Review Score',title='Review Volume and Review Quality') listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = review_scores_value, ##
y = review_scores_location, ##
size = number_of_reviews) + ##
geom_point(alpha = .2, color = 'firebrick') +
theme_bw() listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = review_scores_value,
y = review_scores_location,
fill = number_of_reviews) + ##
geom_tile() + ##
theme_bw() The following code computes the average price of all listings on each day in the data set:
average_price_table <- calendar %>%
mutate(price = price %>% gsub('\\$|,', '',.) %>% as.numeric()) %>%
group_by(date) %>%
summarise(mean_price = mean(price, na.rm = TRUE))Use geom_line() to visualize these prices with time on the x-axis and price on the y-axis.
Using the summary_table object you created earlier, make a bar chart showing the number of apartments by neighbourhood. In this case, the correct geom to use is geom_bar(stat = 'identity').
summary_table %>%
filter(property_type == 'Apartment') %>%
ggplot() +
aes(x = neighbourhood, y= n) +
geom_bar(stat = 'identity')summary_table %>%
filter(property_type == 'Apartment') %>%
ggplot() +
aes(x = reorder(neighbourhood, n), y=n) + ##
coord_flip() + ##
geom_bar(stat = 'identity')summary_table %>%
ggplot() +
aes(x = reorder(neighbourhood, n), y=n, fill = property_type) + ##
coord_flip() +
geom_bar(stat = 'identity') listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews,
y = review_scores_rating,
color = property_type) + ##
geom_point(alpha = .5) +
theme_bw() +
labs(x='Number of Reviews', y='Review Score',
title='Review Volume and Review Quality') listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating, color = property_type) +
geom_point(alpha = .5) +
theme_bw() +
facet_wrap(~property_type) + ##
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') dashboard.rmd, which you can find in the 1_orientation/2_data_science/code directory.dashboard.html at the provided link.ggplot2 CheatsheetR Graphics Cookbook, by Winston ChangRRR packagesR, statistics, and data science at FiveThirtyEight (they use R!)R TricksR